A Segmental HMM for Speech Waveforms
نویسندگان
چکیده
We present a purely time domain approach to speech processing which identifies waveform samples at the boundaries between glottal pulse periods (in voiced speech) or at the boundaries between unvoiced segments. An efficient algorithm for inferring these boundaries is derived from a simple probabilistic generative model of speech and state of the art results are presented on pitch tracking, voiced/unvoiced detection and timescale modification. A Segmental HMM for Speech Waveforms Kannan Achan, Sam Roweis, Brendan Frey Machine Learning Group, University of Toronto 1 Speech Segments in the Time Domain Processing of speech signals directly in the time domain is commonly regarded to be difficult and unstable, due to fact that perceptually very similar utterances exhibit very large variability in their raw waveforms. As a result, by far the most common preprocessing step for most speech systems is to convert the raw waveform into a time-frequency representation, using a variety of spectral analysis and filterbank techniques. In this paper we pursue a purely time domain approach to speech processing in which we identify the samples at the boundaries between glottal pulse periods (in voiced speech) or at the boundaries between unvoiced segments of similar spectral shape (“colour”). Having identified these segment boundaries, we can perform a variety of important low level speech analysis operations directly and conveniently. For example, we make a voiced/unvoiced decision on each segment by examining the periodicity of the waveform in that segment only. In voiced segments we can estimate the pitch as the reciprocal of the segment length. Timescale modification without pitch or format distortion can be achieved by stochastically eliminating or replicating segments in the time domain directly. More sophisticated operations, such as pitch modification, gender and voice conversion, and companding (volume equalization) are also naturally performed by operating on waveform segments one by one without the need for a cepstral or other such representation. The computational challenge with this approach is in efficiently and robustly identifying the segment boundaries, across silence, unvoiced and voiced segments. In this paper we introduce a segmental Hidden Markov Model, defined on variable length sections of the time domain waveform, and show that performing inference in this model allows us to identify segment boundaries and achieve excellent results on the speech processing tasks described above. 2 A probabilistic generative model of time-domain speech segments The goal of our algorithm is to break the time domain speech signal s1, . . . , sN into a set of segments, each of which corresponds to a glottal pulse period or a segment of unvoiced colored noise. Let bk denote the time index of the beginning of the kth segment and sk = (sbk , . . . , sbk+1−1) denote the waveform in the kth segment, where k = 1, . . . , K indexes segments. Our algorithm searches for the segment boundaries, b1, b2, . . . , bK+1, so that each segment can be accurately modeled as a time-warped, amplitude-scaled and amplitude-shifted version of the previous segment. We denote the transformation used to map segment sk−1 into segment sk by Tk. (A similar idea is used in [1] to cluster patterns in a way that is invariant to a set of transformations.) Given the segment boundaries b1, . . . , bK+1 and the transformations T1, . . . ,Tk we † Thanks to John Hopfield.
منابع مشابه
Speech enhancement based on hidden Markov model using sparse code shrinkage
This paper presents a new hidden Markov model-based (HMM-based) speech enhancement framework based on the independent component analysis (ICA). We propose analytical procedures for training clean speech and noise models by the Baum re-estimation algorithm and present a Maximum a posterior (MAP) estimator based on Laplace-Gaussian (for clean speech and noise respectively) combination in the HMM ...
متن کاملHMM composition of segmental unit input HMM for noisy speech recognition
For robust speech recognition in noisy environments, various methods have been studied. In this paper, we apply parallel model combination (PMC) for segmental unit input HMM to recognize corrupted speech in additive noise. Since several successive frames are combined and treated as an input vector in segmental unit input modeling, the increased dimension of vector degrades the precision in esti...
متن کاملSpeaker transformation using sentence HMM based alignments and detailed prosody modification
This paper presents several improvements to our voice conversion system which we refer to as Speaker Transformation Algorithm using Segmental Codebooks (STASC)[2]. First, a new concept, sentence HMM, is introduced for the alignment of speech waveforms sharing the same text. This alignment technique allows reliable and high resolution mapping between two speech waveforms. In addition, it is obse...
متن کاملSpeaker recognition using a trajectory-based segmental HMM
A segmental HMM is a HMM whose states are associated with sequences of acoustic feature vectors (or segments), rather than individual vectors. By treating segments as homogeneous units it is possible, for example, to develop better models of speech dynamics. This paper begins by describing a type of segmental HMM in which the relationship between the state and acoustic level descriptions of a s...
متن کاملBayesian adaptive learning of the parameters of hidden Markov model for speech recognition
In this paper a theoretical framework for Bayesian adaptive learning of discrete HMM and semi continuous one with Gaussian mixture state observation densities is presented Corre sponding to the well known Baum Welch and segmental k means algorithms respectively for HMM training formulations of MAP maximum a posteriori and segmental MAP estima tion of HMM parameters are developed Furthermore a c...
متن کامل